Bagging by Design (on the Suboptimality of Bagging)
نویسندگان
چکیده
Bagging (Breiman 1996) and its variants is one of the most popular methods in aggregating classifiers and regressors. Originally, its analysis assumed that the bootstraps are built from an unlimited, independent source of samples, therefore we call this form of bagging ideal-bagging. However in the real world, base predictors are trained on data subsampled from a limited number of training samples and thus they behave very differently. We analyze the effect of intersections between bootstraps, obtained by subsampling, to train different base predictors. Most importantly, we provide an alternative subsampling method called design-bagging based on a new construction of combinatorial designs, and prove it universally better than bagging. Methodologically, we succeed at this level of generality because we compare the prediction accuracy of bagging and design-bagging relative to the accuracy ideal-bagging. This finds potential applications in more involved bagging-based methods. Our analytical results are backed up by experiments on classification and regression settings. Introduction Bootstrapping (Efron 1979) is arguably one of the most significant developments in statistics. Bagging, the “machine learning” analog of bootstrapping, is one of the most influential methods in creating ensembles for classification and regression problems. A large number of variants and extensions of bagging have been devised since it was originally introduced in 1996 by Breiman (Breiman 1996), including the widely studied random forests (Breiman 2001). Despite this spade of works our mathematical understanding on how accurately bagging and its extensions predict is still limited. In this paper we develop new machinery towards this direction, and introduce a provably better alternative to bagging. In theory, ideal-bagging works as follows. A learner ' given a training set L produces a predictor (for classification or regression), denoted by '(·;L). The learner is provided with independently obtained training sets L1, L2, . . . , constructs the classifiers, and given an instance x it decides by voting (or averaging for regression) with the outputs of '(x;L1),'(x;L2), . . . . This ideal procedure was originally shown (Breiman 1996) to reduce (in fact, not to increase) the Copyright c 2014, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. mean-square error compared to a single classifier. In practice, however, the training samples are limited and the training sets (aka bootstraps) L1, L2, . . . are obtained by subsampling from an originally obtained pool of training data ˆ L. This results in intersections between the L i ’s. Many variants of bagging deal heuristically, either directly or indirectly, with the issue of reducing intersections and provide substantial empirical evidence in favor of reducing them, cf. (Ting and Witten 1997). To the best of our knowledge, prior to our work, the effect of intersection on the prediction accuracy was not mathematically understood. Toy example Here is a very small, made-up example illustrating the effect of intersections – the reader should put this in proportion. Consider a uniform distribution D over 10 points {(0, 0.1), ( 0.01, 0.11), (1, 1.1), (0.99, 0.89), (2, 2.1), (1.99, 1.89), (3, 3.1), (2.99, 2.89), (4, 4.1), (4.99, 4.89)}, which is hidden from us. Use D to sample uniformly a subset of 6 points in ˆ L. Then, sample from ˆ L three bootstraps (3 subsets) L1, L2, L3 each consisting of 3 elements. On each of the L i ’s do linear regression to obtain y
منابع مشابه
Investigating the Effect of Underlying Fabric on the Bagging Behaviour of Denim Fabrics (RESEARCH NOTE)
Underlying fabrics can change the appearance, function and quality of the garment, and also add so much longevity of the garment. Nowadays, with the increasing use of various types of fabrics in the garment industry, their resistance to bagging is of great importance with the aim of determining the effectiveness of textiles under various forces. The current paper investigated the effect of unde...
متن کاملPerformance of Porous Pavement Containing Different Types of Pozzolans
Underlying fabrics can change the appearance, function and quality of the garment, and also add so much longevity of the garment. Nowadays, with the increasing use of various types of fabrics in the garment industry, their resistance to bagging is of great importance with the aim of determining the effectiveness of textiles under various forces. The current paper investigated the effect of unde...
متن کاملA local measurement-based protection scheme for DER integrated DC microgrid using Bagging Tree
In recent years, DC microgrid has attracted considerable attention of the research community because of the wide usage of DC power-based appliances. However, the acceptance of DC microgrid by power utilities is still limited due to the issues associated with the development of a reliable protection scheme. The high magnitude of DC fault current, its rapid rate of rising and absence of zero cros...
متن کاملApplication of ensemble learning techniques to model the atmospheric concentration of SO2
In view of pollution prediction modeling, the study adopts homogenous (random forest, bagging, and additive regression) and heterogeneous (voting) ensemble classifiers to predict the atmospheric concentration of Sulphur dioxide. For model validation, results were compared against widely known single base classifiers such as support vector machine, multilayer perceptron, linear regression and re...
متن کاملOn the evolutionary design of heterogeneous Bagging models
Bagging is a popular ensemble algorithm based on the idea of data resampling. In this paper, aiming at increasing the incurred levels of ensemble diversity, we present an evolutionary approach for optimally designing Bagging models composed of heterogeneous components. To assess its potentials, experiments with well-known learning algorithms and classification datasets are discussed whereby the...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014